May 27, 2015

Background

Myself

Reed College

Reed College

  • Baby hipsters going to baby graduate school
  • Small (1400 students) undergraduate only liberal arts college
    • Smaller class sizes
    • Very motivated and quirky student body
    • Socially progressive but academically conservative
  • Established in 1908 because founder
    • Didn't care for sports
    • Despised social clubs
  • Huge de-emphasis on grades
  • High degree of interaction between departments

Curriculum

  • Not a vocational school
  • Among the schools with the highest proportion of students who do a PhD
  • All students must:
    • take a junior qualifying exam
    • write a senior thesis
    • take freshman humanities class: HUM 110

HUM 110

Statistics at Reed

  • Statistics is within the math department.
  • Classes:
    • Introductory statistics
    • Year-long junior level probability & mathematical statistics sequence
    • A few methods classes in other departments
  • Until this semester, no applied stats class taught by statistician

New Class in Spring 2015

  • MATH241: Case Studies in Statistical Analysis
  • Should have called it "Introduction to Data Science"

The Bigger Picture

Similar Class

Former Googler Rachel Schutt taught a similar Data Science class course at Columbia University.

Class Description

Class Structure

Prereqs

Only intro stats (know what a standard error and regression are) and some exposure to R

Syllabus

  • 6-7 biweekly mini-reports
    • varying degree of open-endedness
    • submitted in R Markdown to encourage reproducible research
    • Large amount of feedback from me
  • Term project: both report and 20 min oral presentation
  • In-class participation

Classroom

Classroom

Demographics

18 students, mostly juniors and seniors.

Major Count
Mathematics 4
Biological Science: Biology & Biochem and Molecular Biology 4
Other Science: Chemistry, Environmental Studies, Physics 4
Social Science: Political Science, Sociology 2
Economics 2
Misc: Psychology, Linguistics 2

Principles

ASA's GAISE Reports

  • Use real data.
  • Stress conceptual understanding, rather than mere knowledge of procedures.
  • Foster active learning in the classroom.
  • Use technology for developing conceptual understanding and analyzing data.

In Practice

  • Messy data, from potentially disparate sources
  • Bottom-up: Let questions/data motivate the statistical methodology, rather than vice-versa
  • Discussions in class
  • Lean on R heavily
  • Focus on the entire analysis pipeline: article in Nature

Tools

Environment: RStudio

How to get students to use R?

  • Key: Forget Base R
  • How? The Hadleyverse.
  • In particular
    • dplyr package for data wrangling
    • ggplot2 package for data visualization

Data Frame

We set the restriction that all our data exists in a matrix called a data frame, which we say has the "tidy" property:

dplyr Verbs

Most data manipulations can be achieved by the following verbs on a "tidy" data:

  1. filter: keep rows matching criteria
  2. summarise: reduce variables to values
  3. mutate: add new variables
  4. arrange: reorder rows
  5. select: pick columns by name
  6. join: join two data frames
  7. group_by: group subsets of observations together

dplyr Piping

The pipe %>% command, pronounced "then".

For example: say you want to apply functions h() and g() and then f() on data x. You can do

  • f(g(h(x))) OR
  • h(x) %>% g() %>% f()

Example

ggplot2: the Grammar of Graphics

ggplot2: the Grammar of Graphics

A statistical graphic consists of a mapping of variables in data to aesthetic attributes of geometric objects that we can observe.

ggplot2 allows us to construct graphics in a modular fashion by specifying these components.

ggplot2: the Grammar of Graphics

ggplot2: the Grammar of Graphics

Data (Variable) Aesthetic Geometric Object
longitude x position points
latitude y position points
army size size = width bars
army direction color = brown or black bars
date (x,y) position text
temperature (x,y) position lines

Results

HW 1: Houston Flight Data

Dataset consisting of all 227,496 flights domestic flights leaving Houston airport (IAH) in 2011. Four data frames:

  • flights: flight info
  • weather: hourly weather info
  • planes: information on all 2853 airplanes
  • airports: destination airport information

HW 1: Delayed Flights

Rennie Meyers

HW 1: Age of Airplanes

Will Jones

HW 1: Destination Cities

HW 2: OkCupid Data

Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=5995\)). 40.2% of the population was female.

Goal was to use logistic regression to predict gender.

HW 2: Listed Job

Miguel Connor

HW 2: Self-Referenced Body Type

Many students

Final Project

Shiny App

  • Houston Flights data
  • OkCupid data
  • Reed jukebox data: Talking Heads
  • Elections data
  • Quandl
  • Babynames datasets
  • Census data (I love FIPS codes!) joined with health data joined with shapefile data

Topics

  • Regressions, time series, spatial autocorrelation (show image), density plots (prostitution in Portland), text mining, sampling bias, First law of geography
  • Webscraping using the rvest package
  • GitHub and R Markdown for reproducibility
  • Good programming practice (Google style guide)
  • GitHub for promoting/sharing work
  • Dates with lubridate
  • String manipulation and basic regular expressions with stringr

Examples:

  • Importance of exploratory data analysis
    • Time zones of Reed Jukebox
    • Erroneous join of counties in US
  • Miguel's quandl analysis

Student Comments

  • One econ student who used to be STATA, is now building a Shiny app for his senior thesis.
  • One biology major, most useful class he's taken at Reed.
  • One biology major said if she took this class earlier in her career, it would have convinced her to be a Math major.

The Future

Areas for Improvement

  • Statistical topics: More on dependent data, missing data, causal inference, and some machine learning
  • Databases/SQL
  • Ask better questions of the students
  • Flipped-classroom
    • Lab exercises at home
    • Problem solving/debugging and discussion in class

Statistics' Image Problem

  • Typical conversation
    • Me (statistician): "Hi, my name is Albert. I'm a statistician."
    • Other party: "Statistics? I hated that class."
  • Atypical conversation
    • Me (statistician): "Hi, my name is Albert. My work involves a lot of data visualization."
    • Other party: "Data visualization? I hate that stuff."

Solution: Data Visualization

Data visualization is a backdoor way to get students interested in statistics.

"Prez" from Season 4 of "The Wire"

Trick them into thinking they're not learning and they do.

Impact on my Intro Stats Classes

Issue: Programming

Issue: Programming

  • Point-and-click vs command line.
  • Thinking algorithmically
  • Debugging: help files and Google
  • Like learning a language

Conclusions

Take Home Messages

  • A statistics class focused on the data
  • Rich Majerus wrote "Why should students at a small liberal arts college learn R?"
    • He learned R using dplyr and ggplot, not base R.
    • New tools like Datacamp are increasing: \[\frac{\mbox{Payoff from learning R}}{\mbox{Startup costs}}\]
  • Data visualization as a "backdoor entrance" for statistics
  • Developing skills and intuition takes time. At Reed classes are small: attention and feedback
  • Interactivity boosts student interest

Google Wisdom Imparted to Students

Presentation on 2011/06/27 given by Dierdre and Amir:

  • Look at your data ASAP.
  • Don't thrash
    • Do your due diligence, but don't overdo.
    • Seek expert advice!
  • "You actually don’t know what you are doing until after you have done it!"
  • Do the most braindead thing first, take it end to end, then iterate and improve.
  • Fight perfectionistism: think of the marginal return of your efforts.

Conclusion